This report explores a dataset containing price, certification, and 9 physical attributes for approximately 597,000 diamonds. The dataset was created by Solomon Messing in 2014, and can be found here.
## [1] 597311 11
## carat cut color clarity
## Min. :0.200 Ideal :369346 G :96053 SI1 :116468
## 1st Qu.:0.500 V.Good:168550 F :93452 VS2 :110997
## Median :0.900 Good : 59415 E :93374 SI2 :104104
## Mean :1.072 H :86555 VS1 : 97677
## 3rd Qu.:1.500 D :73563 VVS2 : 65480
## Max. :9.250 I :70213 VVS1 : 54790
## (Other):84101 (Other): 47795
## table depth cert price
## Min. : 0.00 Min. : 0.00 GIA :463066 Min. : 300
## 1st Qu.:56.00 1st Qu.:61.00 IGI : 43497 1st Qu.: 1220
## Median :58.00 Median :62.10 EGL : 33770 Median : 3503
## Mean :57.63 Mean :61.06 EGL USA : 16070 Mean : 8753
## 3rd Qu.:59.00 3rd Qu.:62.70 EGL Intl. : 11447 3rd Qu.:11174
## Max. :75.90 Max. :81.30 EGL ISRAEL: 11301 Max. :99990
## (Other) : 18160
## x y z
## Min. : 0.150 Min. : 1.000 Min. : 0.040
## 1st Qu.: 4.740 1st Qu.: 4.970 1st Qu.: 3.120
## Median : 5.780 Median : 6.050 Median : 3.860
## Mean : 5.993 Mean : 6.201 Mean : 4.035
## 3rd Qu.: 6.970 3rd Qu.: 7.230 3rd Qu.: 4.610
## Max. :13.890 Max. :13.890 Max. :13.180
## NA's :1814 NA's :1851 NA's :2543
Interestingly there are peaks that occur at for each integer value up to 7 carats, similarly visible peaks occur at the .5 carat values from 0.5-3.5 carat diamonds. This may be due to cultural stigma about purchasing a diamond under a certain weight, or buyers may prefer to purchase a diamond of lesser cut, color or clarity that meets or exceeds these carat values.
Carat summary statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.200 0.500 0.900 1.072 1.500 9.250
Transforming from a linear to log distribution of prices to better understand the shape of my data. There are two distinct peaks, one around $800, and a second peak around $12,500. It appears the market for diamonds is actually two separate markets, one for diamonds priced up to ~$12,500, and a second market for diamonds priced from $12,500 on up.
Price summary statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 300 1220 3503 8753 11170 99990
Observe the cut quality of the diamonds is rightward skewed, with most diamonds having a cut quality of ‘ideal’.
Cut summary statistics:
## Ideal V.Good Good
## 369346 168550 59415
Similarly, color quality (lower letter is better) is rightward skewed as well. It appears most consumers are satisfied with a diamond of color H or better.
Color summary statistics:
## D E F G H I J K L
## 73563 93374 93452 96053 86555 70213 48645 25807 9649
Clarity seems to have a threshold of SI2, relatively few diamonds for jewelry purposes are sold below this clarity. More than half of all diamonds sold have a clarity of at least VS2. Surprisingly more than 5% of all diamonds are considered internally Flawless (IF). Additionally more diamonds are classified IF than are classified I1 and I2 combined.
Clarity summary statistics:
## IF VVS1 VVS2 VS1 VS2 SI1 SI2 I1 I2 I3
## 31156 54790 65480 97677 110997 116468 104104 14355 2284 0
GIA certifies the vast majority of the diamonds included in this dataset, far more than all other certification agencies combined. I wonder if the different certification agencies specialize in different types of diamonds. Which agency has the highest proportion of low-quality diamonds? Which agency has the highest median price? Do you get more diamond for your money from some agencies?
Certification summary statistics:
## GIA IGI EGL EGL USA EGL Intl. EGL ISRAEL
## 463066 43497 33770 16070 11447 11301
## HRD AGS OTHER
## 9936 2958 5266
There are a 8066 diamonds with a depth of 0, and another 663 diamonds with a depth between 0 and 10 mm. Looking at the summary statistics below makes it seem these values may be a data entry error, perhaps the data is off by an order of magnitude?
I find it interesting >12% of all IGI certified diamonds fall into this potential error case. I’ll pay attention to this certification agency going forward to see if any other discrepancies arrise.
## carat cut color clarity
## Min. :0.2000 Ideal :3984 H :1320 SI2 :1494
## 1st Qu.:0.4200 V.Good:3981 G :1289 VS2 :1393
## Median :0.7000 Good : 764 I :1258 IF :1240
## Mean :0.9308 F :1143 SI1 :1234
## 3rd Qu.:1.0100 E :1135 VS1 :1203
## Max. :6.6900 D : 913 VVS2 :1013
## (Other):1671 (Other):1152
## table depth cert price
## Min. : 0.00 Min. :0.00000 IGI :5284 Min. : 320
## 1st Qu.: 0.00 1st Qu.:0.00000 GIA :2099 1st Qu.: 1030
## Median :56.50 Median :0.00000 HRD : 700 Median : 2018
## Mean :36.84 Mean :0.09293 OTHER : 283 Mean : 7060
## 3rd Qu.:59.00 3rd Qu.:0.00000 EGL USA : 269 3rd Qu.: 6020
## Max. :69.00 Max. :7.80000 EGL Intl.: 66 Max. :99640
## (Other) : 28
## x y z
## Min. : 0.690 Min. : 1.000 Min. : 1.000
## 1st Qu.: 4.830 1st Qu.: 4.820 1st Qu.: 2.980
## Median : 5.620 Median : 5.610 Median : 3.530
## Mean : 5.861 Mean : 5.864 Mean : 3.673
## 3rd Qu.: 6.400 3rd Qu.: 6.450 3rd Qu.: 4.030
## Max. :11.940 Max. :12.040 Max. :10.240
## NA's :311 NA's :313 NA's :340
Here is a histogram excluding diamonds with a depth of < 10 mm.
Depth summary statistics (>10mm depth):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.10 61.10 62.10 61.97 62.70 81.30
The sweet spot for diamond depth is betwen 60 mm and 65 mm.
We see a similar issue as above with 2981 diamonds with a table value of 0%, while another 598 diamonds have a table value between 0% and 10%. Looking deeper at these data points reveal all other variables are populated with the exception of depth. Again IGI appears to be represented at a higher rate than would be expected.
Below we see the summary stats for diamonds with a table of < 10%:
## carat cut color clarity
## Min. :0.2000 Ideal :1742 G :682 SI2 :932
## 1st Qu.:0.4100 V.Good:1415 H :630 SI1 :706
## Median :0.7000 Good : 422 F :616 VS2 :592
## Mean :0.9016 E :569 VS1 :531
## 3rd Qu.:1.0200 I :439 VVS2 :285
## Max. :8.0300 D :334 VVS1 :207
## (Other):309 (Other):326
## table depth cert price
## Min. :0.0000 Min. : 0.000 GIA :1722 Min. : 320
## 1st Qu.:0.0000 1st Qu.: 0.000 IGI :1008 1st Qu.: 1050
## Median :0.0000 Median : 0.000 OTHER : 338 Median : 2200
## Mean :0.1284 Mean : 5.607 EGL USA : 304 Mean : 6522
## 3rd Qu.:0.0000 3rd Qu.: 0.000 HRD : 129 3rd Qu.: 5686
## Max. :6.3000 Max. :75.200 EGL Intl.: 54 Max. :97120
## (Other) : 24
## x y z
## Min. : 0.690 Min. : 1.000 Min. :1.000
## 1st Qu.: 4.770 1st Qu.: 4.770 1st Qu.:2.980
## Median : 5.600 Median : 5.610 Median :3.560
## Mean : 5.773 Mean : 5.807 Mean :3.675
## 3rd Qu.: 6.420 3rd Qu.: 6.470 3rd Qu.:4.040
## Max. :12.760 Max. :12.870 Max. :9.870
## NA's :315 NA's :315 NA's :318
Table summary statistics (>10% table):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 56.00 58.00 57.97 59.00 75.90
The sweet spot for table is between 55-60%.
1,814 diamonds have a n/a value for x-axis measurement otherwise they seem to be mostly complete entries. Below you will find the summary statistics for these diamonds.
## carat cut color clarity table
## Min. :0.2300 Ideal :1026 F :412 SI1 :360 Min. : 0.0
## 1st Qu.:0.4000 V.Good: 594 G :353 VS2 :330 1st Qu.:56.0
## Median :0.7000 Good : 194 E :298 SI2 :298 Median :57.0
## Mean :0.8577 H :297 VS1 :279 Mean :47.9
## 3rd Qu.:1.0200 D :184 VVS2 :221 3rd Qu.:58.0
## Max. :5.5400 I :125 VVS1 :215 Max. :68.0
## (Other):145 (Other):111
## depth cert price x
## Min. : 0.00 GIA :1453 Min. : 502 Min. : NA
## 1st Qu.:60.40 IGI : 110 1st Qu.: 1123 1st Qu.: NA
## Median :61.90 OTHER : 95 Median : 2294 Median : NA
## Mean :51.61 HRD : 86 Mean : 5428 Mean :NaN
## 3rd Qu.:62.80 EGL USA: 32 3rd Qu.: 5818 3rd Qu.: NA
## Max. :70.20 EGL : 31 Max. :81180 Max. : NA
## (Other): 7 NA's :1814
## y z
## Min. : 3.210 Min. :2.480
## 1st Qu.: 4.935 1st Qu.:2.880
## Median : 5.585 Median :3.330
## Mean : 6.072 Mean :3.496
## 3rd Qu.: 7.372 3rd Qu.:3.950
## Max. :10.740 Max. :7.460
## NA's :1782 NA's :869
1851 diamonds contain n/a values for y-axis measurement. Below are the summary statistics for these diamonds.
## carat cut color clarity
## Min. :0.2300 Ideal :1048 F :414 SI1 :367
## 1st Qu.:0.4000 V.Good: 606 G :353 VS2 :331
## Median :0.7000 Good : 197 H :323 SI2 :303
## Mean :0.8688 E :296 VS1 :280
## 3rd Qu.:1.0300 D :186 VVS2 :244
## Max. :5.5400 I :127 VVS1 :212
## (Other):152 (Other):114
## table depth cert price
## Min. : 0.00 Min. : 0.00 GIA :1473 Min. : 502
## 1st Qu.:56.00 1st Qu.:60.40 IGI : 115 1st Qu.: 1130
## Median :57.00 Median :61.90 OTHER : 95 Median : 2284
## Mean :48.08 Mean :51.79 HRD : 87 Mean : 5536
## 3rd Qu.:58.00 3rd Qu.:62.80 EGL USA: 38 3rd Qu.: 5868
## Max. :68.00 Max. :70.20 EGL : 27 Max. :81180
## (Other): 16
## x y z
## Min. : 4.610 Min. : NA Min. :0.340
## 1st Qu.: 5.120 1st Qu.: NA 1st Qu.:2.880
## Median : 5.970 Median : NA Median :3.340
## Mean : 6.588 Mean :NaN Mean :3.514
## 3rd Qu.: 7.580 3rd Qu.: NA 3rd Qu.:3.980
## Max. :10.840 Max. : NA Max. :7.510
## NA's :1782 NA's :1851 NA's :891
2543 diamonds contain n/a values for the z-axis measurement. Below are the summary statistics for these diamonds.
## carat cut color clarity table
## Min. :0.200 Ideal :1687 G :501 VS2 :407 Min. : 0.00
## 1st Qu.:0.550 V.Good: 672 H :442 VS1 :397 1st Qu.:56.00
## Median :0.900 Good : 184 F :439 SI1 :384 Median :57.00
## Mean :1.104 E :348 VVS1 :375 Mean :50.47
## 3rd Qu.:1.500 I :272 VVS2 :337 3rd Qu.:58.00
## Max. :5.540 D :264 SI2 :298 Max. :68.00
## (Other):277 (Other):345
## depth cert price x
## Min. : 0.00 GIA :2255 Min. : 350 Min. : 2.000
## 1st Qu.:60.70 OTHER : 111 1st Qu.: 1990 1st Qu.: 5.530
## Median :62.10 IGI : 64 Median : 4292 Median : 6.115
## Mean :54.04 HRD : 46 Mean : 9693 Mean : 6.401
## 3rd Qu.:62.80 EGL : 36 3rd Qu.:11882 3rd Qu.: 7.350
## Max. :70.20 EGL USA: 18 Max. :94849 Max. :10.330
## (Other): 13 NA's :869
## y z
## Min. : 3.100 Min. : NA
## 1st Qu.: 5.590 1st Qu.: NA
## Median : 6.170 Median : NA
## Mean : 6.464 Mean :NaN
## 3rd Qu.: 7.410 3rd Qu.: NA
## Max. :10.380 Max. : NA
## NA's :891 NA's :2543
Given the size of this dataset, I feel comfortable dropping any diamonds with a x, y, z, table or depth value less than 10 (% or mm respectively) or n/a. I will be using this revised dataset for the remainder of this analysis. This leaves me with 585,808 diamonds to examine. I’ve saved the ‘dirty’ dataset of 12,223 diamonds to analyze in more depth below.
Summary statistics for this cleaned dataset are below:
## carat cut color clarity
## Min. :0.200 Ideal :363083 G :94104 SI1 :114673
## 1st Qu.:0.500 V.Good:163616 E :91705 VS2 :109006
## Median :0.900 Good : 58389 F :91665 SI2 :102130
## Mean :1.074 H :84610 VS1 : 95922
## 3rd Qu.:1.500 D :72289 VVS2 : 63986
## Max. :9.250 I :68613 VVS1 : 53417
## (Other):82102 (Other): 45954
## table depth cert price
## Min. :13.00 Min. :30.10 GIA :458110 Min. : 300
## 1st Qu.:56.00 1st Qu.:61.10 IGI : 37976 1st Qu.: 1220
## Median :58.00 Median :62.10 EGL : 33722 Median : 3539
## Mean :57.97 Mean :61.97 EGL USA : 15711 Mean : 8776
## 3rd Qu.:59.00 3rd Qu.:62.70 EGL Intl. : 11371 3rd Qu.:11242
## Max. :75.90 Max. :81.30 EGL ISRAEL: 11271 Max. :99990
## (Other) : 16927
## x y z
## Min. : 0.150 Min. : 1.430 Min. : 0.040
## 1st Qu.: 4.740 1st Qu.: 4.980 1st Qu.: 3.120
## Median : 5.780 Median : 6.070 Median : 3.870
## Mean : 5.993 Mean : 6.205 Mean : 4.041
## 3rd Qu.: 6.970 3rd Qu.: 7.240 3rd Qu.: 4.620
## Max. :13.890 Max. :13.890 Max. :13.180
##
Looking below at diamonds that have ideal cut, color of D, and clarity of IF.
This plot is hard to see much of what is going on, I’ve replotted it below using a base 10 log scale for price.
Here are the summary statistics for the best quality diamonds. The most facinating bit to me is the max size is 2.58 carat vs 9.25 carats for the heaviest stone in the complete dataset.
## carat cut color clarity
## Min. :0.2000 Ideal :4639 D :4639 IF :4639
## 1st Qu.:0.4400 V.Good: 0 E : 0 VVS1 : 0
## Median :1.0200 Good : 0 F : 0 VVS2 : 0
## Mean :0.9451 G : 0 VS1 : 0
## 3rd Qu.:1.2900 H : 0 VS2 : 0
## Max. :2.5800 I : 0 SI1 : 0
## (Other): 0 (Other): 0
## table depth cert price
## Min. :53.00 Min. :56.20 GIA :4372 Min. : 435
## 1st Qu.:56.00 1st Qu.:60.80 IGI : 234 1st Qu.: 2165
## Median :57.00 Median :61.70 HRD : 18 Median :19740
## Mean :57.57 Mean :61.46 AGS : 4 Mean :19557
## 3rd Qu.:59.00 3rd Qu.:62.20 OTHER : 4 3rd Qu.:29070
## Max. :62.50 Max. :63.80 EGL : 3 Max. :99458
## (Other): 4
## x y z
## Min. :2.290 Min. :3.720 Min. :1.500
## 1st Qu.:4.700 1st Qu.:4.895 1st Qu.:3.100
## Median :6.430 Median :6.480 Median :4.000
## Mean :5.972 Mean :6.096 Mean :3.862
## 3rd Qu.:6.970 3rd Qu.:7.000 3rd Qu.:4.350
## Max. :8.930 Max. :8.930 Max. :8.240
##
Now let’s check out the other end of the spectrum, what do the lowest quality diamonds look like?
I’m intrigued the median and mean carat values for the worst quality diamonds match up almost exactly with the best quality diamonds. Both right around 1 carat!
summary(subset(cleaned, cut =='Good' & color == 'L' & clarity == 'I2'))
## carat cut color clarity table
## Min. :0.520 Ideal : 0 L :19 I2 :19 Min. :55.00
## 1st Qu.:0.650 V.Good: 0 D : 0 IF : 0 1st Qu.:58.50
## Median :0.970 Good :19 E : 0 VVS1 : 0 Median :60.00
## Mean :1.043 F : 0 VVS2 : 0 Mean :60.47
## 3rd Qu.:1.260 G : 0 VS1 : 0 3rd Qu.:62.00
## Max. :2.000 H : 0 VS2 : 0 Max. :67.00
## (Other): 0 (Other): 0
## depth cert price x
## Min. :59.30 GIA :13 Min. : 410.0 Min. :5.110
## 1st Qu.:61.00 EGL : 6 1st Qu.: 726.5 1st Qu.:5.510
## Median :62.80 IGI : 0 Median :1365.0 Median :6.170
## Mean :63.09 EGL USA : 0 Mean :1855.9 Mean :6.257
## 3rd Qu.:64.90 EGL Intl. : 0 3rd Qu.:2233.0 3rd Qu.:6.785
## Max. :67.40 EGL ISRAEL: 0 Max. :5957.0 Max. :8.000
## (Other) : 0
## y z
## Min. :5.050 Min. :3.220
## 1st Qu.:5.570 1st Qu.:3.440
## Median :6.110 Median :3.870
## Mean :6.242 Mean :3.942
## 3rd Qu.:6.730 3rd Qu.:4.310
## Max. :8.000 Max. :4.970
##
After cleaning my dataset of non-sensical and n/a values, 585,088 diamonds remain with 11 features (carat, cut, color, clarity, table, depth, certification agency, price, x, y, and z). The variables cut, color and clarity are ordered factor variables with the following levels:
(worst) ————> (best)
Cut: Good, Very Good, Ideal
Color: L, K, J, I, H, G, F, E, D
Clarity: I3, I2, I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF
other observations:
More diamonds are of Ideal cut than are of the other two cuts combined.
Median carat size is 0.900.
Most diamonds are color G or above.
75% of all diamonds in my cleaned dataset are 1.5 carat or less.
Median price is $3,539 with a high value of $99,990.
The features that are most interesting as output values for a model are carat and price. I’m interested in diving into the differences between the different certification agencies. Do some agencies specialize in lesser-quality diamonds? Additionally I’m excited to look at the spike in the number of diamonds sold with weights at or above integer valued carats. I am intrigued to see if lesser cut, clarity or color diamonds are kept bigger to sell at or above these integer values.
## price carat table depth x y z
## price 1.00 0.86 0.03 -0.08 0.72 0.80 0.64
## carat 0.86 1.00 0.05 -0.05 0.86 0.96 0.79
## table 0.03 0.05 1.00 -0.47 0.04 0.06 0.03
## depth -0.08 -0.05 -0.47 1.00 -0.06 -0.09 -0.01
## x 0.72 0.86 0.04 -0.06 1.00 0.89 0.48
## y 0.80 0.96 0.06 -0.09 0.89 1.00 0.82
## z 0.64 0.79 0.03 -0.01 0.48 0.82 1.00
The following plots examine the relationship between price and carat weight for diamonds. First we will look at the overall density plot. Note the large vertical streaks that occur at carat weights ending with .50 and .00. I will be referring to this preference for diamonds to exceed a ‘round’ carat weight as a vanity metric.
Let’s start by looking at price vs carat by certification agency to see if any agencies have specialties. GIA dominates the certification market in sheer nubers as well as price. EGL USA and IGI both bring less of a price premium on diamonds they certify.
Looking at the the same plot colored by the diamond’s color. It does appear a lot of 2 & 3 carat diamonds of lesser color were allowed to come to market relative to other weights.
Now looking at the smae plot by cut quality, not many non-ideal cuts are allowed to come to market.
Similar to color, we observe more diamonds of lesser quality to come to market at 1, 1.5, 2 and 3 carat weights.
Now, I’m getting really intersted in the behaviour of diamond prices vs. quality at the vanity points. Below we can start to observe how prices change with differing color and/or clarity at the vanity points. Additionally notice the how few diamonds are offered in the low quality quadrant, perhaps diamonds with these ratings are used industrially instead of for jewelry?
Looking at Diamond Quality and cut quality at the vanity points. There are thresholds easily visible (I1 clarity and K Color) demonstrating benchmarks to exceed for a diamond to be viable on the market.